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Abstract 

In this paper we propose a prior distribution for the clique set and dependence 
structure of binary Markov random fields (MRFs). In the formulation we allow both 
pairwise and higher order interactions. We construct the prior by first defining a prior 
for the neighbourhood system of the MRF, and conditioned on this we define a prior 
for the appearance of higher order interactions. Finally, for the parameter values we 
adopt a prior that allows for parameter values to equal, and in this way we reduce the 
effective number of free parameters. To sample from a resulting posterior distribution 
conditioned on an observed scene we construct a reversible jump Markov chain Monte 
Carlo (RJMCMC) algorithm. We circumvent evaluations of the intractable normalising 
constant of the MRF when running this algorithm by adopting a previously defined 
approximate auxiliary variable algorithm. We demonstrate the usefulness of our prior 
in two simulation examples and one real data example. 
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1 Introduction 


MRFs are a well u sed model class in spatial statistics, see iKinderma/nn and Snell! (119801 ) and 


Hurn et al. 


(2003) for an introduction. In this paper we consider binary MRFs which is a 


subclass of discrete MRF. An MRF x is usually defined by conditioning on a parameter vector 
so that we have p(x\<p). This (f models interactions between the components of x, and 


in app l icatio n s these inter a ctions are typically limited on l y to 


Besagl ( 19741 ) . iBesagl ( 19861 ). iFriel and Ruel ( 20071 ). 


Everitt 


De pairwis e, see for instance 


(2012) and 


McGrorv et al. 


( 20121 ). 


Mainly three arguments are made for constraining these models to include only pairwise 
interactions. Firstly, the interpretation of the parameters of higher order interaction can 
be difficult, for instance assigning values to the parameters in order to give the MRF some 
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desired property can be a hard task. Secondly, if one allows for higher order interactions, 
the number of parameters grows exponentially as a function of the number of variables in 
a clique. And thirdly, because of the intractable normalising constant, binary and discrete 
MRFs are a computationally intensive class of models, especially if parameter inference is 
required for large dependency struc tures with higher order interactions . It has however been 


shown, see 


Descombes et al. 


(1995) and iT ielmeland and Besagl (1199811 . that for many data 


sets higher order interaction models outperforms pairwise models, for instance when an MRF 
is used as a prior distribution when recovering a noisy image. In those papers reasonable 
interpretation of higher order interactions are in fact given, and computations are carried 
out in reasonable time. In both the above papers, the number of parameters are reduced by 
manually setting the same value to groups of parameters. Only the second of the two papers 
are however performing parameter inference, which is made possible by approximating the 
maximal likelihood estimator obtained by applying a Markov chain Monte Carlo (MCMC) 
technique to estimate the normalising constant (jGeyer and Thom pson 199 2). Other, and 


more recent approximation techniques to circumvent the computational pro 


oleins associated 


and 


McGrorv et al. 

(2012 

). Note however that 

Austad 

(2011) 


A ustadl ( 20111 ). 


Everitt 


( 20121 ) 


just mentioned which performs parameter inference for MRFs with higher order cliques. We 
discuss options for handling the normalising constant later in the paper. 

When assigning a prior to the interaction structure of an MRF this can be done on three 
different levels. First one can assume the neighbourhoods and the parametric form of the 


MRF to be given, and assig n a p rior distribution for 


'his is for instance done in 


H eikkinen a nd Hogmand er ( 199411 and 


he v alues of th e par am eters o nly. 


(119981 ). A second level of prior formulation is adopted in 


Rvden and Titterington 


Arnesen and Tjelmeland (2015). 


The neighbourhoods are still fixed, but here priors are formulated for both the parametric 
form of the MRF and for the assoc iated parameter values. An interpretational parametri- 
sation inspired by ITielmeland and Besagl (119981) is used, and to reduce the effective number 
of free parameters without reducing the flexibility of the MRF one adopts a prior where 
groups of parameters are allowed to have exactly the same value. In this article we go to the 
third level of prior formulation by assigning a prior also for the neighbourhood system of the 
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MRF, and to our knowledge this is the first attempt to learn the nei 
MRF from observed data by such a fully Bayesian approach. As in 


g hbourhood system of an 


Arnesen and Ti elmela nd 


([2015) we keep the prior property that groups of parameters are allowed to be assigned the 


same parameter value, but we need to adopt a new parametrisation of t 


re MRF. To obtain 


posterior samples we develop an RJMCMC sampling algorithm (Green 


1995), and to cope 


(Murrav et al. 

2006 

) of 

Everitt 

o _ zr 

(2012) 


Assigning a 

in the theory c 

prior distribution 

)f graphical mode 

to the 

ling f< 

dependence structure of 

3r categorical data, see 

x is a 

spiege 

much used ar: 

halter et al. 

jproach 

(1993), 

Madigan et al. 

(1995 

), 

Lauritzen 

(1996) 

. DcllaDortas and Forster 

(1999 

) and 

Massam et al. 


(120091 ). but the problems considered there are often small enough so that the normalising 
constant is easily calculated. Also, the variables in such problems often represent features 
of different nature, whereas we assume translational invariance of x similar to all the other 
work on MRFs mentioned above. For graphical models this lack of invariance is compensated 
for by multiple observations of x, whereas we consider situations where only one scene x is 


observed. For more details regarding 


is referred to 


Arnesen and Tielmeland 


he link between these two type of models the reader 
(120151 1. 


The paper is organised as follows. In the next section, Section [2l we introduce notation 
and define some properties for the nodes in a rectangular lattice, while in Section [3] we assign 
binary random variables to these nodes and discuss two possible parametrisation of their joint 
probability distribution. In Section [4] we let this probability distribution be a stationary 
MRF, and using both parametrisations we construct in Section [5] a prior distribution for 
the dependence structure and the parameter values of an MRF. In Section [6] we present the 
proposal distributions we use when sampling from the posterior distribution along with a 
discussion on how to handle the normalising constant. We investigate three examples in 
Section H and lastly we give some closing remarks in Section [8j 
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2 Preliminaries 


In this section we consider a rectangular lattice with periodic boundary conditions and 
introduce the notation we need in later sections to define a prior for stationary MRFs on 
such a lattice. 

Consider a rectangular m x n lattice and let (i,j), where i G {0,1,..., m — 1} and j G 
{0,1,..., n — 1}, denote an arbitrary node in this lattice. Here we adopt the convention that 
i and j specify the vertical and horizontal positions of the node in the lattice, respectively, 
that % = 0 is at the top of the lattice and i — m — 1 at the bottom, and that j = 0 and 
j — n — 1 are at the left and right ends of the lattice, respectively. To denote sets of nodes 
we use lower case Greek letters, and in particular we denote the set of all nodes in the lattice 
by x = {(z, j); i = 0,..., m — 1, j = 0,..., n — 1} and use A and A* to denote arbitrary subsets 
of x, i.e. A, A* C x- In much of what follows we assume the lattice to have torus boundary 
conditions. 

Definition 1 If, for a rectangular lattice x = {(*, j)', i — 0,..., m — 1, j — 0,..., n — 1}, the 
translation of a node ( i,j ) G x with an amount ( t,u ) Gx*s defined to be 

( i,j ) © (t, u) — (i + t mod m, j + u mod n), 
we say that the lattice has torus boundary conditions. 

The translation of each node in a set A C x by an amount (t, u ) G x we denote by A© (t, u ) = 
{(i,j) © (t,u); (i,j) G A}. 

To denote sets of subsets of x we use upper case Greek letters, and at the next level 
we denote sets of sets of subsets of x by upper case Roman letters. In particular we let 
Q(x) = {A; A C x} denote the set of all subsets of x and use A, A* C H(x) to denote 
arbitrary subsets of f2(x)- The H(x) is often called the power set of x, arid one should note 
that in particular it includes the empty set and x itself. We let L denote the partition of 
Q(x) we get by identifying all subsets of x that are translations of each other. 

Definition 2 Let L be the partition of H(x) where, for any two distinct subsets of nodes 
A, A* C x, there exists a A G L so that {A, A*} C A if and only if there exists a (t, u) G x s ° 
that A* = A © (t, u ). 
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The elements in the partition we call cells. One can note that two of the cells in L consist 
of only one element each, namely {0} and {x}, whereas all other cells have mn elements. 
One example of a cell with mn elements is {{ (i, j)}; (z, j) G x}, and another example is 
{{(i,j),(i,j) © (1,0)}; (i,j) G x}- O ne can also note that all elements in a cell A G L 
contains the same number of nodes, and we denote this number by r(A). For instance we 
have t({{(*, j)}; (■ i,j) G x» = 1 and ^({{(bj), (bj) © (M)}; (bi) e A» = 2. 

As all elements in a cell A G L are translations of each other we may specify a cell by 
specifying the relative positions of the nodes in the elements in that cell. This suggests 
a more intuitive notation for the cells, where one represents each node in an element of 
a cell by a box and arranges the boxes according to the relative positions of these nodes. 
For instance we get □ = {{(*, j)}; (i,j) G x}, B = {{(bj), (b j) © (1,0)}; (i,j) G x}, and 
cB = {{(b j), (b j) © (1, -1), (bj) © (1,0)}; (i,j) G x}- In the following we use this box 
representation to refer to specific cells in L, and use the A G L notation whenever we need 
to refer to a generic cell. 

For two distinct cells A, A* G L we let N(A,A*) denote the number of elements A* G A* 
that is a strict subset of an (arbitrary) element A G A, i.e. 

AT (A, A*) = card({A* G A; A* C A for some A G A}), 

where card(A) denotes the cardinality, or the number of elements, in the set A. That this 
number is the same for all A G A follows from the fact that AgA<^A®(1,m)gA for any 
(t, u) G X- Using the box representation of the cells in L it is easy to see what N(A,A *) 
becomes. We have for instance AT ([],□) = 2, AT^n) = 3, A^^g) = 1 and A^(n,g) = 0. 
We use whether or not some elements in a cell A* G L is a subset of an element in another 
(arbitrary) cell in A G A to define a partial ordering of the cells in L. 

Definition 3 For any A, A* G L we define 

A* © A^ N(A,A*) > 0. 

For instance, we have □ -< B, □ -< □B and Q cB, but m/0 and 0 / □. Clearly we also 
have { 0 } -< A for any AgL \ { 0 }. 
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Figure 1: DAG representation of a dense set Met. 

In Section [4] we consider MRFs defined on the lattice x and use a subset of the cells in L 
to specify the set of cliques for the MRF. We denote such a subset by M C L, and use the 
term clique types for the elements in M. The cliques of the MRF will thereby be all A G A 
for all A G M, and will say that A G A is of clique type A for A G M. We also say that a 
A G L is on if A G M, and otherwise A is off. We also define a prior for M and then restrict 
the attention to sets MCL that are dense. 

Definition 4 A set of cells M C L is said to be dense if {A* Gf;AWA}C M for all 
A G M. 

We visualise the elements in a dense set M CL as a directed acyclic graph (DAG), where the 
DAG has one vertex for each cell A G M and an edge from the vertex representing a A* G M 
to the vertex representing a A G M whenever A* -< A and r(A*) = t(A) + 1. For instance, 
Figure Q] shows such a DAG representation for the dense set of cells 

In the next section we present two parametrisations of binary stationary distributions defined 
on x- 


3 Binary stationary distributions on a torus 

Consider a rectangular lattice x — {(b j); i = 0,..., m — 1, j = 0,..., n — 1} with torus 

boundary conditions as in Sectional Now, to each node (i,j) G x we associate a binary 
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variable x h j G {0,1}, and let x = (x it j] ( i,j ) G y) denote a vector of all these binary variables. 
Associating the value 0 with white and the value 1 with black we can think of an x as a 
colouring of the lattice y. A colouring where the nodes in a set A C y are black and all other 
nodes are white we denote by 1 A , i.e. 1 A = (x t ,j = ) G A); (i,j) G y), where /(•) is the 

indicator function that equals unity whenever the argument is true and zero otherwise. 

We denote a distribution for x by p(x) and assume the positivity condition p(x) > 0 for 
all x G {0, l} mn . Thus, we can write 

p(x) = Zexp{I/(a:)}, (1) 


where Z is a (typically computationally intractable) normalising constant and —U(x) is 
often referred to as the energy function of x. In the following our focus is on stationary 
distributions p(x). 


Definition 5 A random vector x defined on a rectangular lattice y with torus boundary 
conditions is said to be stationary if and only if p( 1 A ) = p(l A ®G u )) for all A G ff(y) and 
(t,u) G y. 

In the remainder of this section we consider two representations of U(x) and in particular 
discuss the consequences of imposing the stationarity assumption. In Section 0] we use each of 
these two representations as a basis to define a corresponding parametrisation for stationary 
binary MRFs. The first representation is defined using pseudo-Boolean function theory, 


while the second representation is def i ned c 


irectly on Ufx). 


Following Tj elmeland and A ust adl (2012), we note that U{x) is a pseudo-Boolean function 
and thereby can be expressed as 


u(x) = y, n 


X 


1 , 3 - 


( 2 ) 


Aef2(x) (i,i)e A 

We refer to the 6 parameters as interactio n parameters. More details o n ps eudo-Boolean 


functions and their properties can be found in 


Hammer and Holzmanl (1902) and 


Grabisch et al. 


(120001 ). Using Definition [5] above, we state a theorem that identifies what restrictions the set 
{/3 A ; A G D(y)} must fulfil for p{x) to be stationary. 
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Theorem 1 A random vector x defined on a rectangular lattice x — = 0 — 

1, j — 0,..., n — 1} with torus boundary conditions is stationary if and only if (3 X = /3 A ®C«) 
for all A G O(x) and ( t,u ) G x- We then say that /3 X is translational invariant. 


The result in this theorem is a special case of Theorem 2 of [Arnesen and Tje lmelandl (120151 ). 


The theorem above states that two interaction parameters /3 A and /3 A must have the same 
value whenever A and A* belong to the same cell A G L in the partition of f2(y) defined in 
Section [2j Thus, for x to be stationary /3 A must take the same value for all A G A for each 
A G L, and we denote the common value of (3 X , for all A G A, by /3 A . Using the box notation 
introduced above we then in particular have that /3 d , /fi and /3cB are the common values for 
all (3 X for A G □. A G 0 and A G c0, respectively. 

The second representation we consider for U(x) is the quite naive one obtained by just 
representing the value of U(x) for each possible vector x. For each A G f2(y) we introduce a 
parameter fi x and write 

U(l x ) = </> A , A G 0( x ). (3) 


The stationarity assumption puts similar restrictions on the (j) parameters as it did for the 
/3 parameters. 


Theorem 2 A random vector x defined on a rectangular lattice x — {(*, j)', i = 0,...,m — 
1, j — 0,..., n — 1} with torus boundary conditions is stationary if and only if fi x = (f> x ®^ u ) 
for all A G O(x) and ( t,u ) G X- We then say that <f x is translational invariant. 


Proof of Theorem 2 Combining flf) and (f3jj we get 

p(l x ) = Z exp {fi x }, 
for all A G O(x). Thus we have that 

cj) x = 0 A ®U“) p( i A ) = p( i A ©(t,«)) 

for any A G U(x) and ( t,u ) G x> an( d the result follows. 


Corresponding to what we have for the /3 parameters, we thus have that for x to be stationary 
we must have that <f> x must take the same value for all A G A for each A G L. We therefore 





also introduce a notation corresponding to what we did for the fi parameters, and denote the 
common value of <f> x for all A G A by 0 A . In particular we then have that (jP is the common 
value for all A G □ and correspondingly c/fl and (fZ B for all A G 0 and all A G [f|, respectively. 
It should be noted that even though we use a similar notation for the (3 and 0 parameters, 
the correspondence is not as strong as our notation may suggest. For a given colouring x of 
the nodes in y, the associated 0 parameter is a function of all the components of x, whereas 
the associated [3 parameter is only a function of a subset of the nodes. The next theorem 
states a one-to-one correspondence between {/3 A ; A G L } and {0 A ; A G L}. 

Theorem 3 Consider a stationary random vector x defined on a rectangular lattice y — 
= 0,..., m — 1 ,j = 0,... ,n — 1} with torus boundary conditions. Then there is a 
one-to-one correspondence between {0 A ; A G L} and {0 A ; A G L}. 


Proof of Theorem 3 From (J2J) and ([3]) we have that 


i x = Y. ^ II hi= p + Y ^ f° r aU A e n <x>- 

A*ef2(x) (i,j)e A* 


(4) 


A*cA 


By using Moebious inversion (Laur itze n 1 99 q) it follows from this that 


y = ^(-i) 


card(A)—card(A*) — 1 ± A* 


(5) 


A*cA 


If we invoke the translation invariance for both representations we get from (J4]) that 


0 A = N ( A . A*)/3 a *. (6) 

A*^:A 

We see that we can use this expression to compute all {0 A ;A G L} from {0 A ; A G L} 
recursively. Firstly we can compute the 0 A for which r(A) = 0, i.e. . Next 

we can calculate the 0 A for which r(A) = 1, thereafter all 0 A for which r(A) = 2, and so 
forth until 0 A have been calculated for all A G L. Thus, {0 A ; A G L} is uniquely specified by 
{/3 A ; A eL}. 

If we again invoke the translation invariance for both representations we get from (|5j) 


= 0 A - (—1) card(A) _card (A * ] “ 1 fV (A, A*) 0 A *, 

A*^:A 


(7) 
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where we correspondingly get that {/3 A ; AeL} can be recursively computed from {0 A ; A G L}. 
Again one must first compute f3 A for r(A) = 0, i.e. j3^, then j3 A for which r(A) = 1, 
thereafter all /3 A for which r(A) = 2, and so forth. Thereby {0 A ; A 6 L} is also uniquely 
specified by {0 A ; A G L} and the proof is complete. 

In the next section we present stationary binary MRFs, and use the two representations 
introduced above to establish a parametrisation of MRFs that is suitable for a fully Bayesian 
modelling. 


4 Binary MRFs 


In this section we first give a brief introduction to binary MRFs (ICressie 


1993 


.limn et ah 


2003 ). Thereafter we assume the MRF to be stationary and defined on a lattice with torus 


boundary conditions, and discuss what consequences the Markov assumption then has for 
the /3 and 0 representations introduced above. Finally we use each of the two representations 
to define corresponding parametrisations of an MRF. 

Consider a rectangular lattice x — {(b j)‘, i = 0,..., m — 1, j = 0,. .., n — 1} as in Section 
[2]and a corresponding random vector x = (xij\ ( i,j ) G x) where x % , 3 G {0,1}. In addition to 
the notation introduced above we let, for A C y, x\ = (xij) ( i,j ) G A) denote the collection 
of random variables associated with the nodes in A, and let x^uj) = x x\{(i,j)}- A binary MRF 
is dehned with respect to a so-called neighbourhood system T = ( i,j ) G x} where 

'ip(i.j) C x\{(0 j)} and (i,j) G if(t, u ) O (C M ) £ V’h.j)- The random vector x is then said to be 
an MRF with respect to T if p(x) > 0 for all x G {0, l} mn and it fulfils the Markov property 


P(xij\x-(ij)) =p{xi,j 

for all (i,j) G x an d x G {0, l} nm . Given a neighbourhood system T, a set of nodes A C y is 
said to be a clique if (i,j) G for all distinct pair of nodes ( t,u ) G A. In particular 

one can note that the empty set 0 is always a clique, and so is {(i,j)} for all (i,j) G x- We 
denote the set of all cliques by T, and let T max denote the set of all maximal cliques, where 
a clique is said to be maximal if it is not a subset of another clique. Clearly T C A2(y), 
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T max C 0(x) and T max C T. The Hammersley-Clifford theorem (Cli fford 19901) then states 
that x is an MRF with respect to T if and only if U(x) can be expressed as 


U(x)= Y. ^(*»)> (») 

Aex max 

where V\(x\) is referred to as the potential function for the maximal clique A G T max . 
Combining (J2]) and (JHJ) it follows that /3 X = 0 for all A ^ Y. Thus, U(x) for an MRF is given 
by 


mx)=Y,p x n 


x. 


i,3i 


(9) 


see 


Tjelmeland and Austadl ( 20121 ) for a formal proof. 


AST A 


In the following we assume x to be defined on a lattice with torus boundary conditions, 
to be stationary, and to be an MRF with respect to a neighbourhood system T. Theorem 
[Tjthen gives that (3 X must be translational invariant, and in particular we must then have 
/3 X = 0 /3 A ®^’“) = 0 for A e Q(y), (t,u) G y. Without loss of generality we can thereby 

assume also the clique system to be translational invariant in that 


A G T A © (t, u) G T for all A G 0(y), (t, u) G x- 


( 10 ) 


In turn this implies that the neighbourhoods are translational invariant in that = 

i/>(ij) © (t,u) for all (i,j), ( t,u ) G y. From (ITOl) it follows that we can find an M C L so 
that T = (J 4gM A, and again using that /3 X is translational invariant we get that (J9J) can be 
reformulated as 

u ( x )=yi n Xi >i ■ 

Asm agA ( ij )ga 

Thus, a stationary MRF defined on a lattice with torus boundary conditions is specified by 
the set M and the values of the parameters {/3 A ,A G M}. A natural and frequently used 
assumption in statistics is to allow higher-order interactions in a model only if corresponding 
lower-order interactions already are present. In the formulation of the MRF discussed above 
this means that if A* -< A we can allow A G M only if A* G M, i.e. that M is dense. In 
the following we therefore restrict M to be dense. In addition we restrict, without loss of 
generality, that {0} always is contained in M. 
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Introducing for instance the constraint j3^ = 0, the linear independence of the products 
of x in (TlTl) makes the corresponding p(x) an identifiable parametric model. In a fully 
Bayesian model formulation one would then need to put a prior on M and the parameter 
set {/3 A , A e M\ {{ 0 }}}. However, the f3 parameters relate to cliques of different sizes and 
this makes the interpretation difficult. In particular we think the values of (3 parameters 
associated with large cliques apriori should be expected to be closer to zero than values of 
parameters associated with smaller cliques, but we are not able to make this statement more 
precise than this. In the following we fold an alternative parametrisation based on the 0 
representation discussed above and argue that this makes it easier to formulate a reasonable 
prior. 

Given a stationary MRF defined on a lattice with torus boundary conditions specified 
by M and {(3 A ; A G M} and recalling that f3 A = 0 for A G L \ M, we can use © to 
fold the 0 representation {0 A ; A G L} of the model. In contrast to the situation for the 
/3 parameters, the Markov assumption does not induce any particular zero structure for 
the 0 parameters. However, recalling that we have assumed M to be dense, we easily 
see from ([6]) and (JT)) that there is a one-to-one correspondence between {/3 a ;A G M} and 
{0 A ;A G M}. An alternative to the (3 parametrisation discussed above is therefore to 
parametrise the model by M and {0 A ; A G M}. One could note that then the remaining 0 
parameters, {0 A ; A G L \ M}, are given by (jSJ) and ([7]) as linear functions of {0 A ; A G M}. 
From the definition of the 0 parameters in (j3j) we see that they essentially just specify the 
probabilities for different realisations of x. If no specific prior information is available and 
suggests otherwise it therefore seems reasonable to assign a prior to {0 A ; AgM} where the 
various parameters are exchangeable. 

5 Prior distribution 

When constructing our prior distribution for the parameters in an MRF we have three 
properties in mind. Firstly, we want to assign a positive prior probability to the event that 
A G L is on or off. This is naturally done in the [3 parametrisation, but Theorem [3] still 
enables us to work with the 0 parameters. Secondly, we want a positive prior probability 
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for the event that groups of clique types have exactly the same parameter value, and thirdly 
we want to assign a prior distribution to the parameter values of the MRF. As discussed 
in the last two paragraphs in the previous section, the two last aspects are most naturally 
formulated with the 0 parametrisation. Thus, the main focus will be on the 0 parametrisation 
for the reminder of this paper, and in the following we describe how we define a prior so that 
all the three properties discussed above are obtained. 

For the first property discussed above we note that setting a A G L to be off is equivalent 
to setting 

0 A = ( _ 1 ) card(A)-card(A*) ~ 1 j\T ( A, A* ) 0 A * , 

AM A 

by (J7J). Recalling that L is required to be dense, we have that if A is on, all A* G L where 
A* -< A must be on as well. As in the previous section we let M C L denote the set of all 
clique types in L that is on. Moreover, for integers k > 0 we let = {A 6 M; r(A) = k} 
be the set of all fc’th order clique types in M. As mentioned above we restrict M always 
to contain { 0 }, so clearly M 0 = {{ 0 }} and one must have Mi = 0 or M x = {□}. Moreover, 
one should note that M 2 is one-to-one with the neighbourhood system T, and because of 
symmetry in the neighbourhoods card(0(iy)) = 2-card(M 2 ) for all (i,j) G x■ We assume the 
prior distribution for M, p(M), to have the form 

p(M)=p(M 2 )p(M 1 \M 2 )p({M k ,k> 3}| M 1 ,M 2 ) 

where we in the following carefully explain the three factors in this expression in turn. 

The p(M 2 ) is essentially a prior distribution for the neighbourhood system T, and for an 
arbitrary node (i,j) G x we assume apriori the various (t,u) € x\ {(*, j)} to be a neighbour 
of (i,j) independently of each other, and with probabilities so that nodes close to (i,j) have 
the highest probabilities of being a neighbour. More precisely, we define p(M 2 ) to have the 
form 

P (m 2 ) = [/r?(A) /(AeM2) (i - um I{HM2) ], 

AeL:r(A)=2 

where f v ( A) G [0,1] is the probability for a specific node to be a neighbour, and r) is a 
parameter controlling this probability. In this paper we use 

/„( A) = 

13 


485.75 - 



24.25 - 


5.51 - 


0.99 - 


0.09 - 


Figure 2: For a 100 x 100 lattice, the expected number of neighbours to a node, for given 
value of rj, 2 • E[card(M 2 )|r 7 ], as a function of ?/. 


where d( A) is the Euclidean distance between the two nodes in (an arbitrary) A G A when 


r(A) = 2. We consider rj as random and want to assign a prior distribution to it. For a given 
value of rj we can easily compute the expected number of neighbours of a node, see Figure [2] 
As we can also understand from the definition of p(M 2 ), the expected number of neighbours 
is very high for small values of r) and decreases rapidly as 7] increases. Apriori we do not 
want high probability for the nodes to have a lot of neighbours, so the apriori density for ij 
should be small for values of p close to zero, but otherwise we want the prior to be vague. 
We adopt a gamma prior for r) with mean 3 and standard deviation \f?> ~ 1.73. Close to zero 
this prior density is proportional with rj 2 , and our experience is that this is sufficiently small 
to avoid a very large number of neighbours when simulating from a posterior distribution. 

When defining the factor p(Mi|M 2 ) one should recall that one has only two possibilities 
for Mi, either Mi = 0 or Mi = {□}. Moreover, as we have assumed M to be dense one 
must have Mi = {□} if card(M 2 ) > 0. Thereby we only need to specify prior probabilities 
p(Mi = 0|M 2 = 0) and p(Mi = {n}|M 2 = 0). We assume each of these probabilities to be a 
half, and get 



The conditional prior distribution p({Mk, k > 3}|Mi, M 2 ) is the probability for including 
clique types of orders larger than two, given Mi and M 2 and under the restriction that M 
should be dense. In particular we construct the rest of M by adding sequentially the possible 
interactions of orders 3,4 and so forth, each with a probability p*. Given the second order 
interactions M 2 we first independently add the possible third order interactions with the 
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Figure 3: Maximal possible DAG when M 2 = {^b,0,m, J}. 


probability p*. Then, given the set of third order interactions M 3 we independently add the 
possible forth order interactions with the same probability p*, and so forth. For instance, 
there are four possible third order interactions to add when the second order interactions 
are as given in Figured! namely [f| l= B and ff 3 - If, in this example, all of the third order 
cliques are included it is also possible to include the forth order clique EJ], which gives us the 
maximal possible DAG for the M 2 in this example, see the illustration in Figure [3j For the 
M illustrated in Figured! we see that only one of the third order cliques is included in M, 
and then clearly no forth order clique can be included without violating the restriction that 
M should be dense. Defining v k {M k - 1 ) = (A 6 L; r(A) = k and A* -< A for all A* G M k -i} 
to be the set of all clique types of order k that is possible to include in M given the clique 
types of order fc — 1 we get that 

p({M k , k > 3}|M 1; M 2 ) = ll I ll [p' (AGM *)(l - p^(M^)] l 

fc>3 J 

when M is dense. As a prior for p* we assing a uniform distribution on the interval (0,1). 

Having defined the prior for M as above, the next step in the prior specification is to 

define a prior that specify what sets of clique types that should have the same parameter 

value. We start by defining a partition of M, denoted J/*', and use S G 5? to denote a cell in 
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that partition. We restrict all clique types in a cell S to have the same parameter value, ps 
say. By setting z = {(S', ps)] S G 5^} we thus have 

<p a = Y. rsHAcS) 

(S,tps)£z 

for A G M, and we denote the vector of model parameters by p = (ps', S G 5?). In the 
following we define a prior for {< f > A , A G M} given M, by specifying a prior for z given M. 
We specify this conditional prior for 2 as 


p(z\M) = p(S fi \M)p(<p\S fi ), 


where p{S^\M) is a distribution over all partitions of M, and p(p\5?) is a distribution for 
the parameters values given a partition 5?. 

For p(S?\M) we first want to have a distribution that does not favour partitions with a 
high number of cells, as this would give a corresponding high number of model parameters. 
For instance, assigning a uniform distribution directly on all the possible partitions would 
make the marginal probability for approximately card(M)/2 cells much higher than a state 
with only one cell. Second, we want the prior probability for any particular partition with 
card( e 5^) = r cells to be smaller than any particular partition with r + 1 cells. To see how 
one can construct a prior fulfilling these requirements one should note that it is possible 
to count the number of possible partitions of a set with card(M) elements, noting that the 


definition of a parti 


ion require no cells to be empty. This is known as the Stirling numbers 


of the second kind (R onald L. Graham 198811 , and is for card(<5^) = r given by 


g(card(M), r) = - \ ) (-l) r - fc fc card(M) . 

r ' k =0 ' J 


The function g(card(M), r) is strictly increasing as a function of r from r = 1 to r = r max = 
argmax g(card(M), r) « card(M)/2, and strictly decreasing from r = r max to r = card(M). 

r 

We define p(S^\M) by a marginal distribution for the number of cells in the partition, 
p(card(^)|M), and given the number of cells in 5? we assume a uniform distribution over 
all partitions with the specified number of cells. Thus, we have 

p(card(^)| M) 


p(y\M) = 


g(card(M), card(^)) ’ 
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and use p(caxd(y)\M) to ensure that the prior fulfils the two properties discussed above. 
We define 


p(card(^) = r\M) oc 


g(card(M),r) 


if r < r max , 
otherwise, 


9 (card (M) ,r max ) 2 r “ rmax 

where the factor 2 r in the denominator grows faster than g(card(M), r) as a function of r. 
One should also note that the expression for r > r max is chosen so that it is equal to 1 for 
r = r max . Finally we specify by assuming the components of ip to be independent 

and normally distributed with zero mean and with some common variance cr 2 , but to make 


the model identifiable we add the restriction 


vs = °- 

(S,tp s )ez 

To complete our fully Bayesian setup we assign a vague gamma distribution for a 2 with 
mean 10 and variance 10 2 . 


6 Posterior simulation 


In this section we briefly present how we simulate from the posterior distribution 


p(z, M, 9\x) cx p(x\z, M)p(z\M, 9)p(M\9)p(9), 


( 12 ) 


where p(x\z, M) is a likelihood MRF for an assumed observed scene x, p(z\M, 9) and p(M\9) 
are the priors dehned in the previous section, and p{9) is the hyper-prior for the vector 
of prior parameters 9 = (rj,p*, cr 2 ). When wanting to simulate from this distribution, two 
challenges arises. Firstl y, as als o briefly discussed in the introduction, this is a doubly 


intractable distribution (Murray et al 


20061 ). So in a Metropolis-Hastings algorithm the 


normalising constant for the MRF does not cancel in the expression for the acceptance 
probability. Secondly, we need to specify appropriate proposal distributions for z, M and 
the prior parameter vector 9 = (r),p+, cr 2 ). In the following two sections we describe how we 
tackle these two problems. 
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6.1 Handling the normalising constant 


As in our case a MRF x is usually defined conditioned on a parameter vector. The posterior 
distribution for such models are particularly computer intensive because the normalising 
constant becomes intractable as the dimension of x increas es, se e (pQ) . This is in the literature 
known as a doubly intractable distribution (IMurrav et ah 20061 ). and great efforts have been 
made in order to circumvent exact evaluations of the normalising constants. These techniques 
can roughly be divided into three approaches. The first is to introduce auxiliary variables 
in the Bayesian model in such a way that the resulting acceptance probabilities would be 
without any evalu atio n of the normalising constant. These approaches are discussed in 


Mqller et al. 


(120061 ) and Murray et all (120061) . However to reach the true li miting distribution 


in such MCMC algorithms, we are required to simulate perfect samples (IPropp and Wilson 


1996 ) from the likelihood, which in turn can be very computationally demanding or eve n 


practically infeasible. If perfect simulation from the likelihood is infeasible 


Caimo and Friel 


(2011() and 


Everitt 


(2012) propose to replace the perfect sample with a sample generated by 


MCMC. Such an approach will not exactly target the correct posterior distribution, but the 
experience is that in many situations such a procedure provides a very good approximation to 
what we want. It should also be noted that the value of the parameter vector will typically 
restrict this approach as well, as the MCMC sampler for the auxiliary variable needs to 
converge for each sample. 

A second approach for coping with the computationally intractable normalising constant 


is to replace it by an estimate obtained prior to posterior simulation. 


obbtined by an MCMC a 


( 1992 ) and iHigdon et ah 


gorithm, which is also the approach taken in 


'he estimate is usually 


Geyer and Thompson 


( 1997 ). This approach will however not give the correct limiting 


distribution, and for the approach to be practical it usually requires the dimension of the 
parameter vector to be low. The third option for handling the intractable normalising con¬ 
stant is to replace the likelihood by an approximation, and thereby instead simulate from 
an approximation to the posterior distribution. Several approximation schemes have bee n 


developed, including approximation 
reduced dependency approximation 


3 V pseudo- 


Friel et al. 


ikelihood (Heik kinen and Hogmander 


2009 


McGrpry et al 


19941) . 


2012 J), and approxima- 
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tion by pseudo-Boolean functions flAustad 


2011 


Tj elmelan d and Aust ad 2012). As with the 


second approach we will not get the true limiting distribution using approximations, however 
there are typically no computational restrictions on the dimension and value of the parameter 
vector, although the quality of the approximation may depend on these quantities. 

In our situation we have an MRF with a parameter vector of possibly a somewhat high 
dimension, and in fact we have also put a prior on its dimension, so the second approach is 
not a viable alternative. We have tried both the first approach, with an MCMC sample to 
replace the perfect sample, and the third approach using pseudo-Boolean approximations. 
For our examples we found the first approach to be computationally cheaper and sufficiently 
accurate, so in our simulation examples we present results when adopting this approach. 


6.2 Proposal distributions 

In this section we only give a brief description of the proposal distributions we have used 

when simulating from the posterior distribution. A more detailed description is given in 

the supplementary material to this paper, see Section S.l. In our algorithm we combine six 

proposal distributions making changes in z or M, and three proposal distributions modifying 

the hyper-parameter vector 9. Each of the proposals for z or M is made according to 

a proposal probability vector p Q = (p„, p 2 a , p% p% Po, pt ), where each of these probabilities 

correspond to the order in which we discuss the proposals here. Our Erst proposal is a 

random walk proposal for the ip parameters, adjusting for the sum-to-zero constraint. Our 

second proposal is to take one clique type in a given cell and propose to change its cell 

membership. In the third proposal we propose to decrease or increase the number of cells 

by one, changing the dimension of the parameter vector ip correspondingly. As a forth 

proposal we propose to replace a clique type that is on by a different clique type, copying 

the parameter value and cell membership from the old clique type to the new one. In the 

fifth proposal we propose to turn on a new clique type or turn off an already existing clique 

type. When proposing to turn on a new clique type we propose to insert the new clique 

type into an already existing cell, and when proposing to turn off an existing clique type we 

choose to delete a clique type from a cell with size larger than one, preserving the dimension 

of the parameter vector in both cases. In the sixth and last of the proposals for z and M, we 
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propose to turn on a clique type and assigning it to it a new cell, or to turn off a clique type 
that is the only clique type in its cell. This proposal will also result in a dimension change in 
the parameter vector <p. Note that we allow some of these proposals to lead to states of zero 
prior probability. However for such proposals we do not need to evaluate the likelihood, so 
such proposals are very cheap to evaluate. The first three of the proposals described above 
make changes to already existing parameter values or to the cells of the partition. The last 
three proposals also makes changes to M, i.e. to what clique types that are on or off. In 
the supplementary materials we give the proposals we use when updating the elements of 
6. These proposals are cheap to evaluate and we do one of each of these for each time we 
perform one of the six proposals described above. 


7 Simulation examples 

In this section we investigate our prior and proposal distributions on three examples. Firstly 
we consider a simulation example where data is generated from an Ising model, secondly 
we generate data from a model with 2x2 cliques, and thirdly we look at a real data 
example consisting of presence and absence of red deer. In all our simulation s we evaluate 
the acceptance probabilities by using the exchange algorithm of [Murray et al. (120061) with a 


Gibbs sampler to obtain the auxiliary variables (ICaimo and Friel 


20111. The length of the 


burn in period of the Gibbs sampler are in all simulations set to 200 iterations, although 
experiments suggest that only about half of is sufficient in all our examples. We set p Q = 
(0.1,0.15,0.15,0.15.0.20,0.25) for the two first examples and a slightly different p Q for the 
last example, see Section 17.31 For the tuning parameters described in the supplementary 
materials, see Section S.l, we set a w = 0.2, a g = cr c = 0.3, a v = 2.0, a a = 0.7 and p = 0.1, 
based on some preliminary runs. 


7.1 The Ising model 


In this section we used a perfect sampler (Propp and Wilson 19961 ) to generate one realisation 


from an Ising model on a 100 x 100 lattice with torus boundary conditions. The Ising model 
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Figure 4: Ising example: Data simulated from the Ising model in (1T31) . 


used is given by 


p{x) = exp | 0.4 E J ( 



( 13 ) 


where the sum is over all pairs of first order neighbour nodes, taking the torus boundary 


assumption into account. The realisation used is shown in Figure Q] This Ising model 
corresponds to the set of clique types M = {{0},n,m.0} with the partition either being y = 
{{{0}}, {□}, {m.0}} or y = {{{0}}, {□}, {m}, {0}}. The true parameter values for these two 
model formulations are tp{{®}} = 1.3333, </?{□} = —0.2666 ThmQ} = —1.0666, and <P{{®}} = 1.6, 
^{□l = Oi ^{m} = — 0 . 8 , ^>0 = — 0.8 respectively. To simulate from the posterior in (TT 2 |) 
we ran our sampler for 10 6 iterations, and based on convergence plots we could see that the 
MCMC sampler had converged after about 10 4 iterations. The acceptance rates for each 
of the proposals changing z and M were for this example 0.102,10 -5 , 0.002,10~ 4 ,10 ~ 4 and 
0.001, given in the same order as in p a . The reason for the very low figures for some of 
these acceptance rates is in fact not slow mixing, but just that the posterior distribution has 
essentially all its mass in one state for M and y. The acceptance rates for the three updates 
of r], p * and (J~p were 0.278, 0.499 and 0.462 respectively. 

Discarding the burn-in period of 10 4 iteration, we now investigate the posterior distri¬ 
bution of M, y, the parameter values 99 , and the hyper-parameters 77 , p* and a For M 
the estimated posterior probability of the correct state becomes 0.999, so the sampler just 
very briefly visit other states. If we consider only the realisations that consist of the correct 
state of M , there are only two partitions that are simulated, namely the two partitions given 
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above. The three and four cell partitions have estimated posterior probabilities equal to 
0.991 and 0.009 respectively, so the model with less parameters is preferred. 


To evaluate the posterior distribution for the parameters is complicated when simulating 
from a distribution where not only the parameter values, but also the dimension and inter- 

Celeux et al. (2000). In 


pretation of the parameters vary, see for example the discussion in 
our model the interpretation of both <p and {</> A , A G M } vary with M and y, so to study 
the posterior distribution of the elements of these vectors across different states for M and 
y is of no value. Recalling that (j> A is well defined also when A is off, A ^ M, we instead 
focus on cf) A — 0W, A E L \ {{0}} which has the same interpretation for all M and y. From 
(j3j) we see that 4> A — q 6® = [/(1 A ) — f/(l 0 ) for any A G A. We in particular choose to focus 
on d> A — 


for A G {□, m, F|, ^h, |~P, FP, Ep, p)-|, rH , FRI • Probability histograms of the simulated 
values for these ten statistics are shown in Figure [51 The corresponding values for the Ising 
model in (fl3|i that we used to generate the data set we are analysing now are shown as black 
boxes on the x-axis, and we see in fact that the correct 0 A — values are well recovered. 

For the prior parameters r /, p*, and cr^ we get the results shown in Figure |6l The median 
of the posterior samples of // is 1.62 with 95% credible interval (0.931,3.331). The posterior 
distribution for p* is essentially uniform, which is reasonable since there is no possibility for 
higher order interactions in most of the posterior realisations and thereby this parameter is 
in most of the iterations just simulated from the prior. The median of a ^ is 3.857 in this 
example. 


7.2 A 2 x 2 maximal clique model 

In the next simulation experiment we generated a data set from an MRF with 2x2 maximal 

cliques and M as shown in Figure [3] and y and parameter values as shown in Table [l] Just 

as for the Ising model discussed above there are of course several possibilities for y specifying 

the model we are using, so in Table [T| we give y with the lowest number of elements. We 

also give the corresponding values of cj) A — 0W for A G L\{{0}} as the interpretation of these 

perhaps is easier, and because we want to compare these values with corresponding values 

for the posterior samples. We were not able to obtain a perfect sample by coupling from the 

past for this model, so we generated instead a realisation by a Gibbs sampling algorithm, 
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(h) </Eb - 0® 


(i) 0tfU 0® 



-33 34 33 3.0 


(j) <?£EI- 0® 

Figure 5: Ising example: Probability histograms for the posterior simulated values for 
( f> A — 0® for ten values of A. True values are shown with a black box. 


and the generated data is shown in Figure[?l This data set contains both small areas with 
clustering of black or white nodes and small areas with a checkerboard pattern. 

Also for this example we simulate from the posterior distribution by running our sam¬ 
pler for 10 6 iterations. Based on convergence plots we concluded that the sampler had 
converged after about 10 5 iterations. The acceptance rates for the six proposals for z 
and M were 0.062, 0.022, 0.023,10 -4 ,10~ 4 and 10~ 4 , again given in the same order as in 
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(a) rj (0.931,3.331) (b) p* (0.025,0.975) (c) (0.665,21.694) 

Figure 6: Ising example: Probability histograms for the hyper-parameters rj, p+ and cr^, 
with 95% credible intervals. 


AeL\{{0}} 

(j) A - <f>w 

S e y 

VS 

□ 

-0.10 

{{0}} 

0.4667 

m, B 

-0.75 

H 

0.3667 

W 

0.20 

{rn, B} 

-0.2833 

cB Eb cB 

-1.05 

fb-cP) 

0.6667 

ffl 

-1.10 

IB 3 > B=l > Bd > cB} 

-0.5833 



m 

-0.6333 


Table 1: MRF example with 2x2 cliques: True parameter setting with associated (minimal) 
partition. 



Figure 7: MRF example with 2x2 cliques: Data simulated from the 2x2 model with 
parameters given in Table [1] 

p 0 , while the acceptance rates for rj, p * and cb; was 0.203, 0.333 and 0.347, respectively. 
As in the previous example we investigate the posterior distribution of M, 5?, (j) A — (jffi 
for A e {n^B^cRER^BicBEB}, and 9 = (r 7 ,p*,(T^). For M we hnd that the cor¬ 
rect model are simulated with probability 0.99. The estimated most probable partition, 

however, differ slightly from the true model. The aposteriori most probable partition is 
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P(y\M) 

y 

card(y) 

0.223 

{{{«},[?}, {□,%}, {rnffi. ff.m.cflffl}} 

4 

0.091 

{{{»},cf}. {□.%}, {m.0}_ {ff.iBcflffl}, ft}} 

5 

0.089 

{{{»},ty. {□}, m, {m.0}, 

5 

0.049 

{{{»}.□}. m. m. {m0}, 

5 

0.046 

{{{«}.□}, ftj J}, {m.0}, {F.iBfijcflffl}} 

4 

0.023 

{{{«}}■ {□}, W. Wt {m.0}, {ff.tB&jcflffl}} 

6 

0.022 

{{{0},cP}, {□■%}■ {m.0}. {cfl}} 

5 

0.021 

{{{»},cP}. {□■%}. 

5 


Table 2: MRF example with 2x2 cliques: The aposteriori ten most probable partitions. 


& = {{{0},cP}, {□,%}, which has probability 0.223. This parti¬ 

tion differs from the true partition in that the cell {^bcP} is removed and ^ and J are 
instead inserted into the cells {{0}} and {□}, respectively, and in that the two cells {EJ]} and 
(ER REL Eb cB} in the true partition are merged into one cell. In total eight partitions have 
probabilities larger than 0.02, and these are shown in Table [2j Additional ten partitions 
have posterior probabilities larger that 0.01 (not shown). After the burn-in, the MCMC 
run visited a total 370 partitions for the most probable state of M, and the partition used 
to generate the data set is the aposteriori 75th most probable partition, with an estimated 
posterior probability of approximately 10^ 4 . Looking at the most probable partitions given 
in Table [2] we notice that except one, all of these partitions have one or two cells less than 
that of the true partition. Because of the relatively week interactions in our data set the 
posterior prefers partitions corresponding to a lower number of parameters. However, as 
shown in Figure [HJ the true </> A — values are still not completely missed by the posterior. 

The multi-modality of the probability histograms for (j) A — q can be understood. For 


example the spike at zero in Figure 8(a) is due to □ and {0} being assigned to the same cell. 
Similarly the spikes at zero in the distributions for ^ and J are the result of these clique 
types being assigned in a cell with {0}. Because of the weak interactions in the data, the 
tendency to have too few cells in the partition, compared to the true solution, makes the 
posterior samples often miss the true value of (fi A — However one can argue that this is 
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(j) <?£ EI - </> 0 

Figure 8: MRF example with 2x2 cliques: Probability histograms for the posterior simu¬ 
lated values for 0 A — for ten values of A. True values are shown with a black box. 


done without a dramatic decrease in the likelihood. This last claim can be investigated for 
instance by simulating from the likelihood using some of the model and associated parameters 
produced by the posterior in the MCMC run. Three such simulations from the likelihood for 
three different parameter settings are shown in Figure [91 where the last of these realisations 
is from a parameter setting from the most probable partition given in Table [2] As we can see 

the same three phase patterns as in the original data is present in all of these simulations. 
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Figure 9: MRF example with 2x2 cliques: Simulations for three parameter realisation. 



(a) r) (0.767,2.029) (b) p* (0.216,0.994) (c) a% (0.210,10.437) 

Figure 10: MRF example with 2x2 cliques: Prior-parameters with 95% credible interval. 

In addition the strength of the interaction between the nodes seems similar as well. 

For the prior parameters r/, p*, and cr^ we get the posterior probability histograms shown 
in Figure ITOl The posterior median for r] is 1.207 in this example. The estimated posterior 
median for p* is 0.794, which is reasonable given that this represents the probability for 
turning on clique types A where r(A) > 2. Thus, the shape of the posterior distribution 
have higher probability for values for p* that favours higher order clique types to be on. The 
estimated median for cr^ is equal to 1.133 in this example. 


7.3 Red deer example 

In this section we consider a data set consisting of the classes absence(O) and presence(l) of 


the data set is given in lAugustin et al. 


set has previously been analysed in 


al. 

(1996 

) and 

Buckland and Elston 

(19931). and the data 

Tielmeland and Austad 

(2012 

) and P 

uuiesen and Tielmeland 


(2 0 151). Our focus is to analyse the spatial interaction in the observed values by adopting 
the prior specified in Section [5] 
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Figure 11: Red deer example: The presence/absence of red deer (left), altitude (middle), 
and mires (right) in the Grampians Region of north-east Scotland. 


The observed values do not lie on a rectangular lattice, and to cope with this we include 
the observed x into a larger lattice xi = (x, Xb) such that xi is a rectangular lattice where 
every element of x is separated from the boundary by at least one node in Xb- For this 
extended lattice we then assume torus boundary conditions. In simulation experiments not 
discussed here we have also tried to increase the size of the extended lattice, but we then 
obtained essentially the same results as discussed below. 

In addition to the binary values shown in Figure fill we also have four covariates available 
in each of the nodes, namely altitude, mires, and north and east coordinates. We denote the 
covariate k at a location (i,j) by yij,k, j = 1, 2, 3,4, and model them into the likelihood by 

p(xi\z,M,K,y) = Zexp I U(x t \z,M) + KkVi,j,k , 

\ (*j)ex k =l / 

where U(xi\z, M ) is an energy function modelling spatial interaction as discussed in Sections 
|4j and k = (ki,...,k 4 ) are parameters associated to each of the four covariates. On z and 
M, and the associated hyper-parameter 9, we assign a prior as discussed in Section [5] and, 
independent of z, M and 9, we adopt independent zero mean Gaussian priors with standard 
deviation equal to 10 on /c 4 . 

The resulting posterior distribution of interest is 

p(z, M, 9 , k, x b \x) oc p(xi\z, M, k, y)p(z\M, 9)p(M\9)p(9)p(n). 

When simulating from this distribution we let each iteration consist of one of the six updates 
for z and M discussed in 16.21 or one update of a randomly chosen element in k, a Gibbs scan 
for the elements in Xb, and updates of the elements in 9 as discussed in Section [6721 Therefore 










we increase the dimension of p a , in this example, to include updates of the elements in k 
as well. We set p Q = (0.1,0.1,0.15,0.15,0.20,0.25,0.05) where the first six entries are the 
probabilities as defined before and the last entry is the probability to propose a change for 
one of the elements in k. To generate a potential new value for a chosen element in k we 
add to the current value a zero mean normally distributed random variable with standard 
deviation 0.2. For this example we ran the sampler for 10 8 iterations, and investigation of 
convergence plots suggested convergence after about 10 5 iterations. The acceptance rates 
for the proposals of z, M and k were 0.207,0.026,0.043,0.005,1CT 3 , 0.002 and 0.268, where 
these rates are given in the same order as in p a , and in particular the last acceptance rate is 
for the proposal of the covariates k. The acceptance rates for p, p* and p was 0.212, 0.349 
and 0.469 respectively in this example. 

Again we investigate the posterior distribution of z and M after convergence. In total the 
chain visited more than 10 000 different states of M after convergence, and there are nine 
models with estimated posterior probability larger that 0.02, all given in Table [3] Together 
with posterior probabilities for M and the most probable partition under M, the associated 
neighbourhoods are also shown in this table. Note that all the neighbourhoods in this table 
suggest some anisotropy from up-left to down-right. In addition to the nine states in Table 
0 12 additional states for M have probabilities larger that 0.01. 

The estimated distribution for </> A — q 4® for the same As as considered in the simulation 
examples, and in addition for A e { n q^}, are shown in Figure [T2] Several of these 
histograms are bimodal. In Section S.2 in the supplementary material we investigate this 
bimodality more in detail. In particular we show that the bimodality can be explained by 
splitting the posterior realisations into two groups. If we consider only realisations under 
the most probable posterior set of clique types M = {{0},nm,0. CI h? I= B?Bzi }5 see Tabled we 
get unimodal distributions for the </> A values where its mode belongs to the first mode in 
Figures fl2la). (e), (k) and (1), and to the second mode in the rest of these plots. The second 
group of realisations consists of the remaining realisations, and in fact the distribution of 
( t> A for this group is again bimodal for some 0 A . This bimodality can in turn by explained 
by looking at the realisations separately for </> A when A is either on or off, again see Section 
S.2 in the supplementary materials for details. It is also worth mentioning that most of the 
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Table 3: Red deer example: The realisations of M with posterior probability higher than 
0.02. First column: Estimated posterior probability of a specific M. Second column: Parti¬ 
tion J^ax with the highest posterior probability for the given M. Third column: Estimated 
posterior probability for y m „ given M. Last column: Neighbours (shown with x) for a 
node (shown with ■) resulting from M. 

most probable set of clique types with posterior probability smaller than 0.02, i.e. the most 
probable realisations of M not shown i Table |3l are states containing different combinations 
of the second order clique types m, 0. I= 0, n □ and which are the ones also used for the 
most probable set of clique types given in Table [3j In fact, the posterior probability that M 
is a subset of {{0},n,m.0. l= 0, n □• D n } but not equal to {{0}} or {{0}.n} is as high as 0.307. 
Also, the posterior probability that at least one clique type from the set {m. B^ti, D D D n } is 
included in M is very close to unity. In other words, these second order clique types seem 
to be the building blocks of the dependency structure in this data set. 
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Figure 12: Red deer example: Probability histograms for the posterior simulated values 
for cj) A — cj)^ for twelve values of A. 


To investigate the fit of the likelihood to the data we did the following small experiment, 
similar to what we did for the 2x2 clique MRF example. For three randomly chosen 
realisations of M, z, Xb and n we simulated from the likelihood three realisations of x. The 
results are shown in Figure [13j and we see that these realisation, at least visually, seem to 
have the same spatial dependence structure as the data in Figure fill 

Figure fl4l shows posterior probability histograms for the prior-parameters p, p * and 

31 















































































































































































































































Figure 13: Red deer example: Simulations for three parameter realisation. 



(a) r) (0.739,2.357) (b) p„ (0.034,0.978) (c) <v (2.143,29.080) 

Figure 14: Red deer example: Prior-parameters with 95% credible interval. 


The median of r] is in this example 1.255. The median of p* is 0.538, and as we can see 
from the estimated posterior distribution of p* we slightly prefer models where interactions 
of order higher than two are turned on. The median of oy ,2 is 8.058 for this example. Lastly 
the estimated posterior distributions of the covariate parameters with 95% credible interval 
are given i Figure HU As we can see all of these have credible intervals that does not include 
zero, supporting the need to include them in the model. 

8 Closing remarks 

In this paper we propose a prior distribution for the dependence structure of a stationary 
binary MRF. Our prior is formulated in three levels. First we define a prior for the neigh¬ 
bourhood system. Next, given the neighbourhood system, and thereby the set of cliques, 
we formulate a prior for the parametric form of the potential functions. In particular this 
includes a prior for whether higher-order interactions should be included in the MRF, and 
non-zero prior probabilities are assigned for the event that groups of parameters have exactly 
the same value. The latter serves both as a way to reduce the effective number of param¬ 
eters and to allow properties like symmetry and isotropy. Lastly, given the neighbourhood 
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-0.8 -0.6 -0.4 -0.2 0.0 -0.6 -0.4 -0.2 0.0 


(a) ki (-0.549,-0.146) 


(b) k 2 (-0.441,-0.058) 



(c) k 3 (-0.266,-0.002) (d) k 4 (-0.546,-0.164) 

Figure 15: Red deer example: Estimated marginal posterior distributions for the parame¬ 
ters of the covariates. Estimated 95% credible interval is given for each parameter. 


system and parametric potential functions we define a prior for the parameter values. To our 
knowledge this is the first time a prior is formulated for all three levels in the dependency 
structure specification of an MRF. In particular it is thereby the first attempt to learn the 
neighbourhood system of an MRF from data via a fully Bayesian approach. 

To sample from the resulting posterior distribution we adopt an R.JMCMC approach, and 
in particular we formulate proposal distributions that can be used in such an algorithm. To 
cope with the computationally intractable normalising constant we adopt an auxiliar y var i- 


Everitt 


able algorithm when computing acceptance probabilities, similar to what is done in 

( 2012 ). 

We present three examples of data in this paper. Two examples with simulated data, one 
from the Ising model and one from an MRF with 2x2 maximal cliques. Both the dependency 
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structure and parameter values in these two examples are reasonably well recovered. The last 
example is a data set of presence/absence of red deer in a region in Scotland. The results 
show a clear spatial dependency within the data. In particular the spatial dependency 
involves higher order interaction, which demonstrates the importance of exploring higher 
order interaction MRFs also for data sets where the spatial dependency is week. Clearly 
more experience with our proposed prior is needed, both on synthetic and real world data 
sets. 

In the current article we restrict ourselves to consider only binary data. However a gen¬ 
eralisation to the discrete case is straightforward. The bottleneck of this approach is clearly 
computation time, as we need to simulate an MCMC sample from the likelihood for each pro¬ 
posed change in the parameter vector. A possible strategy to reduce this problem would be to 


replace the MRF with a part ial 


y ore 


ered Markov model ( POMM), see 


Qian and Titterington 


( 1991 b ICressie and Davidsonl ( 1998 1 and iToftakeri ( 2013 1. The POMM is known to be com¬ 


putationally superior to MRF, and we are currently exploring such an option. 

Another important direction for future research is to improve the proposal distributions 
in order to obtain better convergence and mixing of the MCMC algorithm, which in turn also 
would improve the overall computation time. We believe that better proposal distributions 
would help the MCMC algorithm to converge also for more complex data sets than what we 
have presented here. With our current proposal distributions, we have in particular found it 
hard to get satisfactory convergence and mixing properties for data sets with strong spatial 
dependence. One improvement that we believe would help the convergence is in how to 
propose new parameter values when adding or removing a parameter. For example, when 
proposing to add a new parameter in our current algorithm we just propose a new value 
for this new parameter and keep all the other parameter values unchanged, except for the 
sum-to-zero adjustment. A better alternative would be to propose a change also in the other 
parameter values so that the new model is closer, in some sense, to the current model. Fixing 
the value of the potential new parameter, one possibility would be to define the potential 
new values of the remaining parameters by defining the potential new model as a least 
squares approximation of the current model. This strategy has strong similarities with an 
approximation technique used in Austadl (2 01111 . but the results there can not directly be 
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used in our setting because we have adopted a partition of M. 


Supporting Information 

Additional Supporting Information may be found in the online version of this article: 

Section S.l: Details for the MCMC sampling algorithm. 

Section S.2: Investigating the modality in the red deer example. 
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5.1 Details for the MCMC sampling algorithm 

In this section we provide details of the proposal distributions that we use when sampling 
from the posterior distribution 

p(z, M , 9\x) oc p(x\z, M)p(z\M, 9)p(M\9)p(9), 

where p{x\z, M ) is the likelihood MRF and p(z\M, 9), p(M\9) and p(9) are the priors defined 
in Section 5 in the paper. All of the proposal distributions are also briefly discussed in 
Section 6 in the paper. The proposals given in Sections IS. 1.31 and IS. 1.61 are transdimensional 
proposals, resulting in reversible jump Markov chain Monte Carlo (RJMCMC) updates. Note 
that all proposals cannot be applied to all states, or that some of these proposals could result 
in non-reversible states. For these cases we simply regard the proposal as a reject. In this 
section we use S, S * and S ** to denote specific elements of the partition 5?. 

5.1.1 Random walk for parameter updates 

Assume the current state to be z = {(S', <ps)', S E The first proposal is simply a random 
walk proposal for the tp parameters, but correcting the update so that the parameter values 
still sum to zero. We do this by first uniformly at random selecting one of the <p parameters, 
ps* f° r S* G y say, and add to it a random change £ ~ N( 0, cr^), where a w is an algorithmic 
tuning parameter. Next, for the proposed new parameter values to comply with the sum- 
to-zero constraint, we subtract the same value from all the parameters. The potential new 
state 5 becomes 


1 








5.1.2 Switching cell for one clique type 

Assume again the current state to be z = {(S', ips)] S G 5?}. Note that this proposal is not 
possible if the number of cells equals one or all the cells contains only one clique type each. 
To generate a potential new state we first sample a pair of distinct cells S *, S ** G 5? from 
the distribution 

f exp { — ~ <^s**) 2 } if S* 7 ^ S** and card(S 1 *) > 1, 

q(o , S ) oc < 

I 0 otherwise. 

Thus, we have that the number of clique types in S* is strictly larger than one, and pairs 
S*, S ** where the corresponding parameter values are similar have a higher probability to be 
sampled. Given S* and S ** the potential new state is formed by drawing at random a clique 
type A G S'* and moving A over to S**. The potential new state thereby becomes 

z = {(S,<p s );Sey\ {S *, 5**}} U {(5* \ {A}, ^), (S** U {A}, 

5.1.3 To propose a change in the partition 

Again let the current state be z = {(S', (ps)] S G 5?}. In this proposal we propose to increase 
or decrease the number of cells by one, with probability a half for each. We refer to these 
proposals as split and merge proposals, respectively. We start by explaining how to propose 
to increase the number of cells by one (split), before we explain the opposite move (merge). 
Note that the spilt proposal is not possible if the number of cells equals the number of clique 
types, and that the merge proposal is not possible if there is no cell with only one clique 
type or if M = {{0}}. 

To propose to increase the number of cells by one we start by drawing uniformly at 
random one of the cells where the number of clique types in the cell are larger than one. 
Assuming cell S* is sampled, we next uniformly at random draw one of the clique types in 
this cell, A G S'* say, and use this to form a new cell including only this clique type, i.e. {A}. 
This new cell must be assigned an associated parameter value, and to do this we add 

an e ~ N(0, CTg), where a 2 g is an algorithmic tuning parameter, to the old parameter value 

for A, ips*. Finally, for the potential new parameter values to comply with the sum-to-zero 
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constraint, we subtract the same value from all the parameters, so that the potential new 
state becomes 


z = 


S, Vs ~ 


Vs + £ 


S*\{A},vs*~ 


caid(y') + 1 

Vs* + £ 


■,s eY\ {5*} } u 


{A},<^5* + £~ 


Vs* + g 
card (S?) + 1 


caid(y') + 1 

To propose to decrease the number of cells, we can only consider to merge a cell, S* say, 
consisting of only one clique type, i.e. card(5'*) = 1, into another cell, S'** say, in order for 
this proposal to conform with the split move specified above. We therefore first sample two 
distinct cells S*, S** G 5? from the distribution 


q(S*,S* 


oc 


exp { —(ys* — y s **) 2 } if S* ^ S** and card(S*) = 1, 
0 otherwise. 


Given S* and S** the potential new state is formed by first merging S* into S** and thereafter 
correcting all parameter values with the same amount to obtain a proposal that comply with 
the sum-to-zero constraint. The potential new state then becomes 


5 = i S, vs + 


Vs* 


;5ey\ {S*, S**} U S*U S**, vs** + 


Vs* 


cavd(y') — 1 


card(^) — 1 / 

The split and merge proposals give a change in the dimension of the parameter space, 
and in the acceptance probability for these RJMCMC steps we then need to include a Jacobi 
determinant. For the split and merge proposals specified above it is straightforward to show 
that the Jacobi determinants become and respectively. 


S.1.4 To propose a change in M by replacing a clique type 

In this step we propose to replace one of the clique types that is on with another clique 

type. Note that this proposal is not possible if there are no clique types with r(A) > 1, or 

if the proposed clique type either consists of two equal nodes or is equal to a clique type 

already in M, in both cases giving a non-reversible state. Assuming the current state to be 

M and z — {(S', Vs)] S G ^}, we start this proposal by uniformly drawing one of the clique 

types A G M where r(A) >1. In the following we let A* denote the cell in 5? that contains 

A. Then consider an arbitrary element in A, A G A say, and draw at random one of the 
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elements in A, (i,j) say. We then generate a modified version of A by replacing (i,j) with 
one of the first order neighbours of (i,j). More formally, we first draw uniformly at random 
a ( k , l) from the set {(0,1), (1, 0), (0, n — 1), (m — 1, 0)}, and let the modified version of A be 
given by A* = (A \ {( i,j )}) U {(i, j) © (k, l)}. A corresponding modified version of A is then 
A* = {A* © (t,u); ( t,u ) G y}. The potential new state is then generated by just replacing 
A with A* in the definition of M and z. The potential new values of M and z becomes 
M = (M \ {A}) U {A*} and 5 = {(S', ip s )- S G ^ \ {S'*}} U {((5* \ {A}) U {A*}, 

S.1.5 To propose a change in M by adding/deleting a clique type 

Again let the current state be M and z = {(S,(fs)',S G 5?}. In this proposal we either 
add a clique type to M or delete a clique type from M, but keeping the same number of (p 
parameters. For the adding proposal note that a clique type that is already in M could be 
proposed or that the proposed clique type could consist of two equal nodes, in both cases 
giving a non-reversible state. Also, the deleting proposal is not possible if there are no cells 
with more than one clique type. First we draw at random whether to increase or decrease 
the number of clique types, with probability a half for each. In the following we first explain 
the increase proposal and thereafter the decrease proposal. 

We now explain how we propose to increase the number of clique types, keeping the same 
number of y parameters. We start by drawing uniformly at random one of the existing clique 
types in M, A say. A new clique type A* is then generated in three steps. First we take 
out an arbitrary element of A, A say, and draw uniformly at random a node (i,j) from the 
set of nodes in A. Secondly we draw uniformly at random a (k,l) G y \ {(0,0)} from the 
distribution 

q(k, l ) oc exp(-p(|fc| + \l\)) for (k, l) G y \ {(0, 0)}, 

where p is an algorithmic tuning parameter. Given (k, l ) a new modified version of A, A* is 
constructed by A* = A U © (k, l)}. Thirdly, a modified version of A, A* is constructed 

by A* = A U {A* © (t,u)] ( t,u ) G y}, and we generate the potential new value of M by 
M — MU {A*}. To generate a corresponding potential new value for z we first draw 
uniformly at random a cell S* from all the cells in and then let the potential new value 
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of z be given by 5 = {(S', ip s )\S G S' \ {S'*}} U {(S* U {A*}, 

Next we explain the proposal we make when the number of clique types is to be decreased 
by one, but keeping the same number of <p parameters. We start by uniformly drawing one of 
the clique types that belongs to a cell with more than one element. Let A denote the clique 
type we draw and let S* denote the cell that contains A. The potential new state is then 
generated simply by removing A from the current state, so that the potential new values 
for M and z become M — M\ {A} and z* = {(S, <ps); S G 5? \ {S*}} U {(S'* \ {A}, 92 s*)}, 
respectively. 


S.1.6 To propose a change in M and in the partition 


Let again the current state be M and z — {(S, ips)', S' G b^}. In this proposal we either add 
a clique type to M or delete a clique type from M, and increase or decrease the number of 
parameters correspondingly. For the adding proposal note again that a clique type that is 
already in M could be proposed or that the proposed clique type could consist of two equal 
nodes, in both cases giving a non-reversible state. Also, the deleting proposal is not possible 
if there are no cells with only one clique type or M — {{0}}. We first draw at random which 
of these two possible proposals to perform, with probability a half for each. I 11 the following 
we start by explaining the increasing proposal, and thereafter move on to the decreasing 
proposal. 

We start by generating a A* in exactly the same way as we did for the increase proposal 
in Section IS. 1.51 Next, we form a new cell containing only A*. To generate the associated 
parameter value we follow a strategy similar to what we did in Section IS. 1.31 We 

first add an £ ~ N(0, of), where a c is an algorithmic tuning parameter, to the current value 
(fr A *, and thereafter subtract the same value from all the ip parameters to get a proposal 
that comply with the sum-to-zero constraint. The proposed new values for M and z then 
becomes M — M U {A*} and 


z = { (5, <ps ~ 


+ € 


card( t y 7 ) + 1 


; S G 5? > U < ( {A*}, cj) A + e — 


+ £ 


card(^) + 1 


respectively. 

Next we explain how we generate a proposal where the number of clique types is decreased 
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by one, and the number of ip parameters is correspondingly reduced. We start by drawing 
uniformly at random one of the clique types that is the only member of a cell. Let A denote 
the one we sample and let S* = {A} denote the corresponding cell. This clique type and 
its corresponding cell is now simply deleted from the current state resulting in the proposed 
values M = M \ {A} and 

where we again has subtracted the same value from all the ip parameters to comply with 
the sum-to-zero constraint. As in Section IS. 1.31 it is straightforward to show that the Ja¬ 
cobi determinants for the increase and decrease proposals become ca ^^f). and ca T?. 

^ ^ card(,X)+l card^cX) —1 

respectively. 

S.1.7 Proposal for the model parameters 

In total we have three model parameters that we assign prior distributions to, namely 77 , 
p* and cry We make one proposal for each of these parameters for every iteration of the 
sampling algorithm. Firstly we update 77 by proposing a zero mean Gaussian change with 
standard deviation a v . Secondly we propose a new value for p* simply by sampling a new 
value from a uniform distribution on the interval (0,1). For the last parameter cG we propose 
a zero mean Gaussian change with standard deviation a a . 

S.2 The multi-modality in the red deer example 

In this section we provide a more detailed investigation of the multi-modality in the posterior 

distributions for </> A — 0W shown in Figure 12 in Section 7.3. As mentioned in the paper the 

modes can be explained by first splitting the posterior samples into two groups, depending on 

the value of M. The first group is all samples where M takes the aposteriori most probable 

value M max = { { 0 }, □, co, 0 , 0 ^}, and the second group is the rest of the samples, i.e. 

samples where M 7 ^ M max . From Table 3 we see that the posterior probabilities of these two 

groups are 0.284 and 0.716, respectively. Making individual histograms for the (f) values in 

these two groups give the results shown in the second and third rows in Figures fS. 2. lf[S22~4 

6 










which can be compared with the corresponding complete histogram in the first row. As we 
can see the results for M max are unimodal, whereas for M 7 ^ M max some of the histograms 
are still bimodal, see in particular the histograms for m, 0 . I= j^, n □ and '“j-j This bimodality 
can be explained by splitting the realisations in the group M ^ M max into two subgroups, 
one for realisations of </> A where A is on, and one for realisations of (j) A where A is off. The 
results for (j) A under this split is shown in the forth and fifth rows in Figures 1S.2.1HST2.41 
Note however that this spilt results in empty or almost empty sets of realisations of 0 A for 
some A, in which case we do not make a histogram (denoted NA in the figures). 
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Figure S.2.1: Red deer example: Histograms for d A for some A e L for different subgroups 
of the posterior samples. Results for different clique types A in each column, and different 
subgroups in each row. The conditions given in the first column are defining the different 
subgroups. 
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Figure S.2.2: Red deer example: Histograms for for some A e L for different subgroups 
of the posterior samples. Results for different clique types A in each column, and different 
subgroups in each row. The conditions given in the first column are defining the different 
subgroups. 
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A = C B 


A = ff 
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Figure S.2.3: Red deer example: Histograms for d A for some A <E L for different subgroups 
of the posterior samples. Results for different clique types A in each column, and different 
subgroups in each row. The conditions given in the first column are defining the different 
subgroups. 
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Figure S.2.4: Red deer example: Histograms for oi A for some A £ L for different subgroups 
of the posterior samples. Results for different clique types A in each column, and different 
subgroups in each row. The conditions given in the first column are defining the different 
subgroups. 





























